AITopics | benchmark evaluation

Collaborating Authors

benchmark evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Preliminary suggestions for rigorous GPAI model evaluations

Paskov, Patricia, Byun, Michael J., Wei, Kevin, Webster, Toby

arXiv.org Artificial IntelligenceAug-20-2025

This document presents a preliminary compilation of general-purpose AI (GPAI) evaluation practices that may promote internal validity, external validity and reproducibility. It includes suggestions for human uplift studies and benchmark evaluations, as well as cross-cutting suggestions that may apply to many different evaluation types. Suggestions are organised across four stages in the evaluation life cycle: design, implementation, execution and documentation. Drawing from established practices in machine learning, statistics, psychology, economics, biology and other fields recognised to have important lessons for AI evaluation, these suggestions seek to contribute to the conversation on the nascent and evolving field of the science of GPAI evaluations. The intended audience of this document includes providers of GPAI models presenting systemic risk (GPAISR), for whom the EU AI Act lays out specific evaluation requirements; third-party evaluators; policymakers assessing the rigour of evaluations; and academic researchers developing or conducting GPAI evaluations.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.7249/PEA3971-1

2508.00875

Country:

North America > United States (1.00)
Europe (1.00)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.94)
Government > Regional Government > North America Government > United States Government (0.93)
Education (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.46)

Add feedback

In-the-loop Hyper-Parameter Optimization for LLM-Based Automated Design of Heuristics

van Stein, Niki, Vermetten, Diederick, Bäck, Thomas

arXiv.org Artificial IntelligenceOct-7-2024

Large Language Models (LLMs) have shown great potential in automatically generating and optimizing (meta)heuristics, making them valuable tools in heuristic optimization tasks. However, LLMs are generally inefficient when it comes to fine-tuning hyper-parameters of the generated algorithms, often requiring excessive queries that lead to high computational and financial costs. This paper presents a novel hybrid approach, LLaMEA-HPO, which integrates the open source LLaMEA (Large Language Model Evolutionary Algorithm) framework with a Hyper-Parameter Optimization (HPO) procedure in the loop. By offloading hyper-parameter tuning to an HPO procedure, the LLaMEA-HPO framework allows the LLM to focus on generating novel algorithmic structures, reducing the number of required LLM queries and improving the overall efficiency of the optimization process. We empirically validate the proposed hybrid framework on benchmark problems, including Online Bin Packing, Black-Box Optimization, and the Traveling Salesperson Problem. Our results demonstrate that LLaMEA-HPO achieves superior or comparable performance compared to existing LLM-driven frameworks while significantly reducing computational costs. This work highlights the importance of separating algorithmic innovation and structural code search from parameter tuning in LLM-driven code optimization and offers a scalable approach to improve the efficiency and effectiveness of LLM-based code generation.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.16309

Country: Europe > Netherlands (0.15)

Genre: Research Report > New Finding (0.86)

Industry: Transportation (0.49)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

How to Evaluate Different Machine Learning Deployment Solutions

#artificialintelligenceApr-19-2022, 15:30:26 GMT

Reach out to us at deployML@wallaroo.ai for a free evaluation. The emergence of Big Data in decision-making to achieve strategic business objectives has led to machine learning (ML) becoming a key enabler for driving growth, achieving operational excellence, and bringing innovative products to market. This shift has come about as the primary obstacles for ML are being overcome: data engineering at scale and model development are no longer daunting to enterprises given the many efficient and simple solutions provided by cloud or 3rd-party vendors. As a result, ML went from something only the bleeding edge innovators (such as Netflix and Amazon) were doing, to now a strategic enabler for organizations in the "early majority" stage of adoption. However, enterprises soon find that building a machine learning model isn't the end of the road but just the beginning of a new set of challenges: Because this is all so new, most enterprises do not have a pre-defined set of parameters to evaluate the different solutions for operationalizing ML models. As a result, they are not sure which attributes will allow their AI-enabled products and operations to scale in the long term as they add more models, use more data, or build more complex models.

benchmark evaluation, different machine learning deployment solution, evaluation, (4 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

benchmark evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

3d226fb8fbd6ee6ec70d0427f1319707-Supplemental-Conference.pdf

Preliminary suggestions for rigorous GPAI model evaluations

In-the-loop Hyper-Parameter Optimization for LLM-Based Automated Design of Heuristics

How to Evaluate Different Machine Learning Deployment Solutions